Wrangling data in R

Leonard Blaschek

“80% of data analysis is cleaning”

— Ancient proverb

A quick word about myself

Types of data

  1. Excel sheets
  2. Delimited text files
  3. Folders of raw data
  1. Insane, lawless text files
  2. Proprietary formats

Excel sheets

Tidy Data1

  • No white space
  • One observation per row
  • One Variable per column
  • No information in formatting



Spot the untidyness!

R fundamentals

ggplot()            # function
ggplot              # object
996107              # number
"ggplot"            # string
?ggplot()           # show help page 
library(tidyverse)  # use library() to load the tidyverse package

Package vignettes

?readr # navigate to package index and then vignettes

Function help pages

?read_tsv()

Arguments without default need to be supplied.

The Pipe





library(readr)
nrow(read_csv("data/cleaned_example.csv"))
[1] 16
"data/cleaned_example.csv" |> 
  read_csv() |> 
  nrow()
[1] 16

The pipe is typed as either %>% or |>

Delimited text files

CSV

CSV2

TSV

Folders of files

list.files()

Create a read-in function

Data cleaning

Missing values

Long and wide data

Separating compound variables

Correcting data classes

Data analysis

Grouping

Mutate

Summarise

purrr

When you’re stuck

  1. Know which package/function you need? — Help pages and vignettes!
  2. Know what you want to do but not where to start? — Try an LLM, e.g. perplexity.ai
  3. I feel like I’ve done this before … — Keep your old scripts organised and annotated, chances are you’ll need that little hack you came up with again in a month or two.

Exercises!

Open up 2023_ggplot2_exercises.rmd and give it a try

Resources to go further